In parts one and two we considered a dataset consisting of bill and flipper length for penguins:
import pandas as pdimport seaborn as snsdf = ( pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv", ).dropna() # missing data will confuse things)[['species', 'bill_length_mm', 'flipper_length_mm']]sns.relplot( data = df, x ='bill_length_mm', y ='flipper_length_mm', hue ='species', height=8, hue_order = ['Adelie', 'Gentoo', 'Chinstrap'])
<seaborn.axisgrid.FacetGrid at 0x7f655a6d6470>
and implemented a function that could guess the species of a new penguin based on its measurements by finding the penguins with most similar measurements:
import numpy as npdef guess_species(bill_length, flipper_length):# calculate distances and add to the dataframe flipper_length_difference = df['flipper_length_mm'] - flipper_length bill_length_difference = df['bill_length_mm'] - bill_length overall_distance = np.sqrt( flipper_length_difference **2+ bill_length_difference **2 ) df['distance to new point'] = overall_distance# find closest points and calculate most common species most_common_species = ( df.sort_values('distance to new point') .head(10) ['species'] .mode()[0] )# the most common species is our guessreturn most_common_speciesguess_species(45, 211)
'Gentoo'
If you haven’t read parts one and two of this series, or if it’s been a while, then I recommend at least skimming them so that you will be able to jump in here. And as a reminder, all of the pandas/seaborn magic that we are using for data processing and visualization are explained in the Biological Data Exploration book.
Naming things
Before we start with anything new, we should introduce a few specific terms to bring our descriptions into line with machine learning terminology:
the function that we wrote is a classifier, since its job is to start with measurements and generate a label
the output of the function is its prediction (this sounds a bit more scientific than guess)
the collection of penguin measurements that the function uses is the training set
the process of inputting the training set (which in this case we do by simply having the function use our existing dataframe) is called fitting
the training set consists of two parts: the measurements which we call features, and the species names which we call classes. (I know this is confusing since in biology class is a taxonomic level, but we will ignore that for now)
It might be interesting at this point to note that although we’ve implemented this function to work specifically for our dataset, the general approach - to predict the class of a new data point, count the classes of the closest points and pick the class that appears most often - seems like it would work for other pairs of measurements. And if we can figure out a way to calculate distances between points in more than 2 dimensions, it might even work for data points with more than two measurements. We will come back to this idea later in this series….for now, back to the function that is in front of us.
The most interesting bit of terminology is the name of the algorithm that we have implemented in the function - ~K nearest neighbors. I will use the American spelling for neighbors as that’s how the function name is spelled in various machine learning libraries.
Choosing parameters
The K bit of the name is because there is a parameter that we can tweak in this algorithm: the number of closest points that we want to consider when making our prediction. In the version of the function above we have used 10, but that’s an arbitrary choice, and we can easily rewrite the function so that k is an argument:
def guess_species(bill_length, flipper_length, k):# calculate distances and add to the dataframe flipper_length_difference = df['flipper_length_mm'] - flipper_length bill_length_difference = df['bill_length_mm'] - bill_length overall_distance = np.sqrt( flipper_length_difference **2+ bill_length_difference **2 ) df['distance to new point'] = overall_distance# find closest points and calculate most common species most_common_species = ( df.sort_values('distance to new point') .head(k) ['species'] .mode()[0] )# the most common species is our guessreturn most_common_species
We can now fairly easy find sets of measurements for which we get different predictions depending on which value of K we pick:
This raises an obvious question: which value of K should we pick to get the most accurate results from our classifier function? To answer it we need to look in a bit more detail about scoring a classifier. We have already seen one way to do this in part two: we wrote some code to run the prediction function on all of the real life penguin measurements, then compare the predictions to the true species and ask what fraction of the time they matched:
# generate predicted species for k=10predictions = df.apply(lambda p : guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10) , axis=1)# compare predictions to true species and get the fraction that match(predictions == df['species']).value_counts(normalize=True)[True]
So it seems like a trivial matter to try this for a bunch of different values of k and just pick the highest score:
for k inrange(1, 10):# generate predicted species predictions = df.apply(lambda p : guess_species( p['bill_length_mm'], p['flipper_length_mm'], k # this will be different each time ) , axis=1 )# compare predictions to true species and get the fraction that match score = (predictions == df['species']).value_counts(normalize=True)[True]print(k, score)
However, under this scoring scheme, a k value of 1 will always give a perfect score. It’s easy to see why when we think about the test data: because each point we are testing is actually in the dataset, the closest neighbor will always be the point itself. So by setting k to 1 the only point that will be used in the prediction will be the same one we are testing. In other words, we are testing the function by measuring how well it can predict the classes of data points that it has already seen.
In machine learning, this is a big mistake. After all, the whole goal of building a classifier is to get reliable predictions for new points (in our case, new penguins). So to fairly measure the performance, we should test the function by using it to generate predictions for measurements where we know the correct species, but which are not in the training set.
Training and testing
The easist way to do this is to randomly assign each penguin (i.e. each row in our dataframe) to be used either for training or testing, but not both. One way to do this is by adding a new column to the dataframe, which will be ‘training’ 80% of the time and ‘testing’ 20% of the time. numpy can do most of the work for us with its random.choice function:
df['assignment'] = np.random.choice( ['training', 'testing'], # these are the items to randomly choose333, # this is the number of choices we want p=[0.8, 0.2] # these are the probabilities of picking each item )df
species
bill_length_mm
flipper_length_mm
distance to new point
assignment
0
Adelie
39.1
181.0
20.302955
training
1
Adelie
39.5
186.0
16.077624
training
2
Adelie
40.3
195.0
10.344564
training
4
Adelie
36.7
193.0
14.396180
training
5
Adelie
39.3
190.0
13.520725
training
...
...
...
...
...
...
339
Chinstrap
55.8
207.0
10.600000
training
340
Chinstrap
43.5
202.0
7.803204
training
341
Chinstrap
49.6
193.0
5.035871
training
342
Chinstrap
50.8
210.0
12.014991
training
343
Chinstrap
50.2
198.0
0.000000
training
333 rows × 5 columns
and if we do value_counts on the resulting column we can see the actual numbers:
df['assignment'].value_counts()
training 251
testing 82
Name: assignment, dtype: int64
Since the assignments are generated randomly we’ll get slightly different absolute numbers each time we run the code, but the split will always be roughly 80%/20%. These percentages are abitrary of course; we could instead choose to use 10% or 30% of the data for testing.
Now we can use the new column to split the penguins up into a training dataset and a testing dataset, which we can store in variables for convenience:
These new dataframes are simply subsets of the original rows. Now we are prepared to do a fairer test of the function. We’ll replace the reference to df with a reference to training_data, which will make sure that the function can only “see” the species of the points in the training set:
def guess_species(bill_length, flipper_length, k):# calculate distances and add to the dataframe flipper_length_difference = training_data['flipper_length_mm'] - flipper_length bill_length_difference = training_data['bill_length_mm'] - bill_length overall_distance = np.sqrt( flipper_length_difference **2+ bill_length_difference **2 ) training_data['distance to new point'] = overall_distance# find closest points and calculate most common species most_common_species = ( training_data.sort_values('distance to new point') .head(k) ['species'] .mode()[0] )# the most common species is our guessreturn most_common_species
And when we run our testing code, we will only use the test data:
# generate predicted species for k=10predictions = test_data.apply(lambda p : guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10) , axis=1)# compare predictions to true species and get the fraction that match(predictions == test_data['species']).value_counts(normalize=True)[True]
In this version of the test, the training data and testing data are completely separate - we use one set of data points to train the classifier (i.e. to find the nearest neighbors) and an entirely different set of data points to test it. In other words, we are now asking how well the function can predict the species of a penguin it has never seen before, which is a much closer scenario to real world use.
Let’s now put the new testing code in a loop, and calculate a score for each value of k between 1 and 10:
for k inrange(1, 10):# generate predicted species predictions = test_data.apply(lambda p : guess_species(p['bill_length_mm'], p['flipper_length_mm'], k) , axis=1 )# compare predictions to true species and get the fraction that match score = (predictions == test_data['species']).value_counts(normalize=True)[True]print(k, score)
Finally, it seems like we have a clear answer: for this particular problem, we get the most accurate predictions with k=5 (i.e. by considering the five closest points when trying to classify a new point), with worse predictions if we make k either higher or lower.
Unfortunately, it’s a bit more complicated than that. The result we have obtained is only true for the specific split of data into training and testing subsets. This will be obvious if you run the code yourself - chances are, you’ll get a different result from mine above.
Remember: there’s nothing special about the particular split we have obtained here - we got it by randomly assigning each row to either the training or test data. To illustrate this, look what happens when we repeat the splitting process ten times and then calculate the score for each different random split. Notice that in this loop we are keeping k fixed at 10, and just doing the same experiment repeatedly:
for _ inrange(10):# randomly assign the rows to either training or test df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2]) training_data = df[df['assignment'] =='training'] test_data = df[df['assignment'] =='testing']# calculate score predictions = test_data.apply(lambda p : guess_species( p['bill_length_mm'], p['flipper_length_mm'], 10# k is always 10 ) , axis=1 ) score = (predictions == test_data['species']).value_counts(normalize=True)[True]print(score)
The scores vary a lot, from a near-perfect 97% accuracy to a much less impressive 91% accuracy. Thinking about it logically, we can see why this must be the case. In some iterations, due to random chance, we will get a training set and test set that are very similar, in which case the function will perform well. But in some iterations, the training and test sets will be very different, leading to a lower score. If we are very unlucky, we might get a split where most of the Gentoo penguins end up in the training set, but most of the Adelie penguins end up in the test set. Due to the simple species-counting logic that our function uses, it will very rarely predict Adelie.
Another way to visualise this randomness is to draw multiple prediction charts for different random selections of training data. We can use the code from the previous part, and put it in a loop:
# ignore some pandas warnings that we don't care aboutpd.options.mode.chained_assignment =None# make 10 different prediction grids with the same value of kbill_lengths = np.arange(35, 60, 1)flipper_lengths = np.arange(170, 230, 1)data = []for iteration inrange(10):# randomly assign the rows to either training or test df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2]) training_data = df[df['assignment'] =='training'] test_data = df[df['assignment'] =='testing']for bill_length in bill_lengths:for flipper_length in flipper_lengths: guess = guess_species(bill_length, flipper_length, 3) data.append((bill_length, flipper_length, guess, iteration))data = pd.DataFrame(data, columns=['bill_length', 'flipper_length', 'prediction', 'iteration'])# plot the predictions on multiple chartssns.relplot( data = data, x ='bill_length', y ='flipper_length', hue ='prediction', hue_order = ['Adelie', 'Gentoo', 'Chinstrap'], col='iteration', col_wrap =5, height=2, edgecolor ='none')
Here I’ve used k=3 to make the effect easier to see in small charts. Notice how the outlines of the prediction regions are slightly different in each iteration, due to the random sampling of data points in the training set, even though we are keeping k exactly the same. Chapter 8 of Biological data exploration has details of how to create these kind of small multiple plots.
We will finish this part of the series here. In the next part, we’ll pick up the idea of scoring a classifier and look at a few ways to get round the problem of randomness. If you don’t want to miss it, sign up for the Python for Biologists newsletter just below this article.